Introduction to R and Rstudio
Session - Cleaning data with {dplyr}
Artwork by @allison_horst
Wrangling
Is the reshaping or transforming of data into a format which is easier to work with
This is often the largest part of many analyses and data science
A note on tidy data
Tidyverse functions work best with tidy data:
Each variable forms a column.
Each observation forms a row.
(Broadly, this means long rather than wide tables)
{dplyr} package
{dplyr} is a language for data manipulation
Most wrangling puzzles can be solved with knowledge of just a few dplyr verbs or functions
Many of the concepts of these functions exist in SQL but dplyr (and other packages) can extend this further
6 functions/verbs to start with
6 key verbs will help us gain a deeper understanding of our data sets.
Note summarise() can also be spelt summarize()
dplyr:: arrange ()
dplyr:: filter ()
dplyr:: mutate ()
dplyr:: group_by ()
# followed often by
dplyr:: ungroup ()
dplyr:: summarise ()
Building with steps
These verbs aren’t used independently of each other.
Each can be a step in the code, like a recipe but can also be repeated.
A recipe starts with:
potato then
peel then
slice into medium sized pieces then
boil for 25 minutes then
mash
Recipe as code
The potato is the object in R terms and the steps are verbs or functions
Take a potato then peel thenslice into medium sized pieces thenboil for 25 minutes thenmash
potato |>peel() |>slice(size = "medium" |>boil(time = 25) |>mash()
Pipe
Shortcut key Ctrl+Shift+m
You might be familiar with the pipe |> which comes from the {magrittr} and is available through the {tidyverse} but the new pipe |> doesn’t require any packages to run
Q1. Which organisation provided the highest number of Mental Health (MH) beds?
arrange()
Reorder rows based on selected variable
beds_data |>
arrange (beds_av)
Descending data
We need descending order:
beds_data |>
arrange (desc (beds_av))
desc() works for text and numeric variables
Q2. Which 2 organisations provided the highest number of MH beds in September 2018?
We’ll use arrange() as before
But we require only observations with this date
filter()
The expression inside brackets should return TRUE or FALSE. We are choosing rows where this expression is TRUE.
beds_data |>
filter (date == "2018-09-01" )
A negative test of equality
To exclude and test where the expression is NOT equal !=
beds_data |>
filter (date != "2018-09-01" )
Ordered and filtered
beds_data |>
arrange (desc (beds_av)) |>
filter (date == "2018-09-01" )
Find the top 2 organisations
This isn’t a key function but useful
beds_data |>
arrange (desc (beds_av)) |>
filter (date == "2018-09-01" ) |>
slice_head (n = 2 )
Q3. Which 5 organisations had the highest percentage bed occupancy in September 2018?
We’ll use arrange() as before
We’ll use filter() as before
But we don’t have a percentage variable in the data
Create new variables
= in this context is an alias not a test of equality
beds_data |>
mutate (perc_occ = occ_av / beds_av) |>
filter (date == "2018-09-01" ) |>
arrange (desc (perc_occ))
Q4. What was the mean number of beds, (across all trusts) for each value of date?
Let’s first look at how we’d produce summary statistics like a mean
And then see how this can be applied to groups of data
summarise()
Collapses a single summary value
beds_data |>
summarise (mean_beds = mean (beds_av))
Missing values
We’ll need to remove NA values to get a suitable mean. TRUE can also be T
beds_data |>
summarise (mean_beds = mean (beds_av,
na.rm = TRUE ))
Have a go!
Instead of mean() use median()
object |>
summarise (new_name = function_name (column_name,
na.rm = ???))
Use a sum() statistic twice
object |>
summarise (col_1 = function_name (beds_av,
na.rm = ???),
col_2 = function_name (occ_av),
na.rm = ???)
)
Answer for summary statistics
median()
beds_data |>
summarise (per_occ = median (beds_av,
na.rm = TRUE ))
# A tibble: 1 × 1
per_occ
<dbl>
1 241
sum()
beds_data |>
summarise (total_beds = sum (beds_av, na.rm = TRUE ),
total_occupacy = sum (occ_av, na.rm = TRUE ))
# A tibble: 1 × 2
total_beds total_occupacy
<dbl> <dbl>
1 412480 368434
Applying summarise() to groups
Now we know how to use summarise (mean)
We’ll produce a summary value for each value of date
group_by() - persistent grouping
group_by() does nothing to the output alone.
The change occurs behind the scenes.
beds_data |>
group_by (date)
ungroup()
Seeing ungroup()
head() is the base R version of slice() and is unaffected by group_by()
beds_data |>
group_by (date) |>
head (1 )
beds_data |>
group_by (date) |>
ungroup () |>
slice_head (n = 1 )
Group by - in operation
The summarise() function currently relies upon the group_by() function to apply summary statistics to groups/sets but in a new release the reliance will be changed to be function specific. Currently code
beds_data |>
group_by (date) |>
summarise (mean_beds = mean (beds_av,
na.rm = TRUE ))
Q5. Which 5 organisations have the highest mean % bed occupancy?
For each of the organisations group_by()
Add 2 summary statistics, total_beds/total_occupancy to summarise() the data
mutate() the data to create a percentage using the previous
Order to find highest by using arrange()
Hint
beds_data |>
group_by () |>
summarise () |>
mutate () |>
arrange ()
Solution
beds_data |>
group_by (org_name) |>
summarise (total_beds = sum (beds_av, na.rm = TRUE ),
total_occupancy = sum (occ_av, na.rm = TRUE )) |>
mutate (perc_occ = total_occupancy / total_beds) |>
arrange (desc (perc_occ))